7 research outputs found
Unsupervised Learning of Long-Term Motion Dynamics for Videos
We present an unsupervised representation learning approach that compactly
encodes the motion dependencies in videos. Given a pair of images from a video
clip, our framework learns to predict the long-term 3D motions. To reduce the
complexity of the learning framework, we propose to describe the motion as a
sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent
Neural Network based Encoder-Decoder framework to predict these sequences of
flows. We argue that in order for the decoder to reconstruct these sequences,
the encoder must learn a robust video representation that captures long-term
motion dependencies and spatial-temporal relations. We demonstrate the
effectiveness of our learned temporal representations on activity
classification across multiple modalities and datasets such as NTU RGB+D and
MSR Daily Activity 3D. Our framework is generic to any input modality, i.e.,
RGB, Depth, and RGB-D videos.Comment: CVPR 201
Towards Vision-Based Smart Hospitals: A System for Tracking and Monitoring Hand Hygiene Compliance
One in twenty-five patients admitted to a hospital will suffer from a
hospital acquired infection. If we can intelligently track healthcare staff,
patients, and visitors, we can better understand the sources of such
infections. We envision a smart hospital capable of increasing operational
efficiency and improving patient care with less spending. In this paper, we
propose a non-intrusive vision-based system for tracking people's activity in
hospitals. We evaluate our method for the problem of measuring hand hygiene
compliance. Empirically, our method outperforms existing solutions such as
proximity-based techniques and covert in-person observational studies. We
present intuitive, qualitative results that analyze human movement patterns and
conduct spatial analytics which convey our method's interpretability. This work
is a step towards a computer-vision based smart hospital and demonstrates
promising results for reducing hospital acquired infections.Comment: Machine Learning for Healthcare Conference (MLHC
Differentially Private Video Activity Recognition
In recent years, differential privacy has seen significant advancements in
image classification; however, its application to video activity recognition
remains under-explored. This paper addresses the challenges of applying
differential privacy to video activity recognition, which primarily stem from:
(1) a discrepancy between the desired privacy level for entire videos and the
nature of input data processed by contemporary video architectures, which are
typically short, segmented clips; and (2) the complexity and sheer size of
video datasets relative to those in image classification, which render
traditional differential privacy methods inadequate. To tackle these issues, we
propose Multi-Clip DP-SGD, a novel framework for enforcing video-level
differential privacy through clip-based classification models. This method
samples multiple clips from each video, averages their gradients, and applies
gradient clipping in DP-SGD without incurring additional privacy loss.
Moreover, we incorporate a parameter-efficient transfer learning strategy to
make the model scalable for large-scale video datasets. Through extensive
evaluations on the UCF-101 and HMDB-51 datasets, our approach exhibits
impressive performance, achieving 81% accuracy with a privacy budget of
epsilon=5 on UCF-101, marking a 76% improvement compared to a direct
application of DP-SGD. Furthermore, we demonstrate that our transfer learning
strategy is versatile and can enhance differentially private image
classification across an array of datasets including CheXpert, ImageNet,
CIFAR-10, and CIFAR-100
Alleviating Human-level Shift : A Robust Domain Adaptation Method for Multi-person Pose Estimation
Human pose estimation has been widely studied with much focus on supervised
learning requiring sufficient annotations. However, in real applications, a
pretrained pose estimation model usually need be adapted to a novel domain with
no labels or sparse labels. Such domain adaptation for 2D pose estimation
hasn't been explored. The main reason is that a pose, by nature, has typical
topological structure and needs fine-grained features in local keypoints. While
existing adaptation methods do not consider topological structure of
object-of-interest and they align the whole images coarsely. Therefore, we
propose a novel domain adaptation method for multi-person pose estimation to
conduct the human-level topological structure alignment and fine-grained
feature alignment. Our method consists of three modules: Cross-Attentive
Feature Alignment (CAFA), Intra-domain Structure Adaptation (ISA) and
Inter-domain Human-Topology Alignment (IHTA) module. The CAFA adopts a
bidirectional spatial attention module (BSAM)that focuses on fine-grained local
feature correlation between two humans to adaptively aggregate consistent
features for adaptation. We adopt ISA only in semi-supervised domain adaptation
(SSDA) to exploit the corresponding keypoint semantic relationship for reducing
the intra-domain bias. Most importantly, we propose an IHTA to learn more
domain-invariant human topological representation for reducing the inter-domain
discrepancy. We model the human topological structure via the graph convolution
network (GCN), by passing messages on which, high-order relations can be
considered. This structure preserving alignment based on GCN is beneficial to
the occluded or extreme pose inference. Extensive experiments are conducted on
two popular benchmarks and results demonstrate the competency of our method
compared with existing supervised approaches.Comment: Accepted By ACM MM'202